An efficient approach for sequence matching in large DNA databases

نویسندگان

Jung-Im Won

Sanghyun Park

Jeehee Yoon

Sang-Wook Kim

چکیده

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie’s leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Assisted History Matching Workflow and its Application in a Full Field Reservoir Simulation Model

The significant increase in using reservoir simulation models poses significant challenges in the design and calibration of models. Moreover, conventional model calibration, history matching, is usually performed using a trial and error process of adjusting model parameters until a satisfactory match is obtained. In addition, history matching is an inverse problem, and hence it may have non-uni...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

A Parallel Architecture for DNA Matching

DNA sequences can be often showed in fragments, little pieces, found at crime scene or in a hair sample for paternity exam. In order to compare that fragments with a subject or target sequence of a suspect, we need an efficient tool to analyze the DNA sequence alignment and matching. So DNA matching is a bioinformatics field that could find relationships functions between sequences, alignments ...

متن کامل

An Index based Pattern Matching using Multithreading

Pattern matching, the problem of finding sub sequences within a long sequence is essential for many applications such as information retrieval, disease analysis, structural and functional analysis, logic programming, theorem-proving, term rewriting and DNA-computing. In computational biology the essential components for DNA applications is the exact string matching algorithms. Many databases li...

متن کامل

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

Mining interesting patterns from DNA sequences is one of the most challenging tasks in bioinformatics and computational biology. Maximal contiguous frequent patterns are preferable for expressing the function and structure of DNA sequences and hence can capture the common data characteristics among related sequences. Biologists are interested in finding frequent orderly arrangements of motifs t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

J. Information Science

دوره 32 شماره

صفحات -

تاریخ انتشار 2006

An efficient approach for sequence matching in large DNA databases

نویسندگان

چکیده

منابع مشابه

A Novel Assisted History Matching Workflow and its Application in a Full Field Reservoir Simulation Model

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

A Parallel Architecture for DNA Matching

An Index based Pattern Matching using Multithreading

An Efficient Approach to Mining Maximal Contiguous Frequent Patterns from Large DNA Sequence Databases

عنوان ژورنال:

اشتراک گذاری